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TITE PROVISION OF SUBJECT DATA ON EUDISED RECORDS 



1* T oxtninolof^ y 

1.1 The word 'data' is nov/ frequently treated as a singular 
English noun rather than as a, I»atiii plural (see, for instance, 
many of the contributions to 'Encyclopaedia of linguistics, 
information and control' ). Tiiis usage will be followed 
throufchout this ntudy. 

1.2 MillD^ hay drawn a convexiient distinction between two broad classes 
of subject data : 

a) data which civeo a direct indication of a document's subject 
content, e.g. classmarks, controlled teims selected from a 
thesaurus, 'free languagtD' terms extracted from a document's text 

h) data v;hich only indicates subject content in an oblique or 

indirect manner, e.g. the bibliographic citations appended to a 
document, which are sometimes taken - at least, for the purposes 
of citation indexing - as being subject-indicative .Only the 
typos of subject data falling within category (a) above will be 
considered hero. 

1.3 The expression 'indexJ.ng system' will here be used in a general sense, 
to signify any system (of whatever kind - classification, thesaxirus, 
etc.) for providing documents with subject data. 

2. The Sc ope of the P r esent Stud y 

2.1 The overall aim of this study is to make certain tentative suggestions 
as to the types of subject data which might be provided on EDDISED 
records, so as to emble these records to be used as the basis for a 
wide range of effective infoimation services. 

2»2 Strictly, it is not ponsiblc to discuss the pa?ovision of subject data 
O on machino-rcadablc rocordo without some thoufiht as to how this data 
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will bo crcabod and xir?cd. This raises such wattei'n as indcxi-ng 
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policies (e.g# as to exhaustivity of indexing), data processing 
capabilities, and search techniques. Since these topics) cannot bo 
dealt with in detail here, they will only be mentioned whore it is 
felt that they have a direct bearing on the decision as to the type 
of subject data best fitted to a particular purpose. 



3. The Conceptual Approach 

3.1 'Jhis study will first set out to identify the types of infoimation 
service and search facility which might be offered as a modem 
infoamtion system utilising the EUDISED data base. A distinction will 
be drawn between * retrieval searches* and *brov/sing searches*. 

3.2 Various types of indexing system will be defined, and compared in 
respect of their performance potential for retrieval search. The 
comparison will be made on the basis of six performance criteria. 

3.3 A possible method for providing a multill^igual browsing facility in 
bibliographic tools will be suggested. This leads to a general 
consideration of the problems of providing multilingual access in 
information systems. Finally, a tentative strategy is outlined for 
providing EODISED records with subject data. 

4. Information Services Based on EUPISE]) Records 

4»1 The first requirement is to identify the types of information service 
which might make use of subject data on EUDISED records. According to 
Thompson^ an information system can be regarded as leaving three main 
f\mctions : current awareness, retrospective search and the compilation* 
of literature surveys. The last of these functions will be disregarded 
here, since it does not directly involve subject data. The main types 
of current awareness and retrospective search services one would expect to 
find in a modem 'computer-based information system are as follows : 

a) An On-line Search Service 

In order to keep computing costs and telecommunication charges 

within reasonable bounds, a limit is normally set on the number 

of citations a user may retrieve' on-line. Often, only part 

T-rJ^r> of* the database is accessible on-line - the older portion of tho 

tRlL 

^i^mam file being availablq solfly for off-lino search. 




(a typical example of this approach is to be found in 
MBDIJmS (MEDical Literature Analysis and Retrieval System) 
which now lias a data base of raore than 2,000,000 records. At 
the time of writing, only about 600,000 of those records can 
be accessed through MEDLINE .(MEDLAES on-line)^. UK MEDLARS 
restricts the output for an on-line search to a maximum of 25 
citations) 

b) An Off-line Retrospective Search Service 
Off-line searching is applicable when ^ 

i) a user is unable to search on-line - perhaps through lack 
of access to appropriate telecommunication facilities. 

ii) a comprehensive search is reciuired, involving back- files 
which arre not g.vailable Oii-line. 

iii) the number of citations to be retrieved exceeds the 
peimitted limit for on-line searches. 

c) Sel ective Dissemination of Information tSDl) 

The services so far described primarily cater for » one-off 
searches. SDI provides for repetitive searching in which user 
profiles are matched at regular intervals, a^inst the 
most current portion of the database. Profile matcliing is 
noiToally performed as an off-line batch processing operation. 

d) Recurrent Biblioffrraphies 
These might be of two types : 

i) a bibliography covering the total intake to the database. 
Such a bibliography would typic^illy be issued quite 
frequently (e.g. monthly), with occasional cumulations 
(say, six-monthly and annual) (cf.» Current Index to 
Journals in Education* ) 

ii) bibliographies covering certain defined subsets of the total 
database, e.g. documents of a particular form (say, audio- 
visual materials); documents in a particular subject area 
(say, »The education of the handicapped') (of. the special 
bibliographies produced, by some ERIC (Educational Resources 
Inforcoation Center) clearingjioases). 

5 
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e ) Curron t Av/aronot; s l^mllotins 

Like, tho recurrent bibliogx-aphies, these might be of two 
types : 

J.) a bulletin covering all recent additions to the 

database4 . • ' 

ii) e^eparate bulletins covering various types of recent 
additions 

Current awareness bulletins are noiroally Issued with high 
frequency (say^ weekly or fortnightly). 

5« Typos of Search 

5.1 Searches will here be divided into 'retrieval searches' and 'browsing 
searches'. The former are conventional literature searches in which 
the user has at least some notion - albeit a' vague one - of the type 
of document he is seeking. 'Browsing' characterises the beliaviour 
of a user whose search philosophy might be siiramed up as, 'I don't know 
what I'm looking for, but I shall know it when I see it'. 

6 . The Re]atio n r>hip Between Types of Search and Types of Service 

6.1 The table below relates type of search to type of information service* 





TYPES OP" SEARCH 


TYPES OP SERVICE 


Retrieval searches 


Brovfsing 


On-line search service 


/ 




Off-line Retrospective 






search service SDI 






Recurrent bibliographies 


J 


/ 


Current awareness 
bulletins 




/ 



k ^ indicates that a service makes provision for a particular type of 
search, a blank indicates tliat it does not. 
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6.2 Some e:>Tlanation is required of the grounds for assuming in the 
table that jnost information services only cater for a particular 
type of search. 

a) On-line Sear ching 

It would be vineconomical - in temis of computer time and 
telecommunication char(jes - to allow protracted browsing 
searches to be conducted on-line. 

b) Off-line Searchinj? (retrospective, and SDl) 

Browsing is obviously impossible when searching off-line, since 
ifieithor the index nor the citations file is visible to the 
searcher during the search process. 

c) Current Av/arenens Bulletins, 
On the assvimption that 

i) current awareness bulletins are published sufficiently 
freqiiently to ensure that each issue contains only a 
relatively small number of citations^ 

ii) these citations are arranged in subject groupings^ 

^ a user should be able to identify possibly relevant citations 
by browsing quickly through the appropriate section(s) of a 
bvdletin. The implication is that current awareness bulletins 
need not provide for retrieval searches, and so can dispense 
with subject indexes. This approach seems particularly justified 
if bulletins are later to be replaced by recurrent bibliographies 
vdth full subject indexes. 

6.5 At this point, we shall temporarily set aside any further mention of 
browsing searches; this topic will be taken up again in Section 10. 
Sections 7-9 are given over to a consideration of the types of subject 
data appropriate to retrie\'al searches. 

7 
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1. 



PerfoxTnance Criteria Applicable to Ro t rioval S ear ches 



7a 



Infoimtion pyt^teras are generally judged in terms of : 

a) effectiveness, i.e. the degree to which they satisfy users 



requirements • 

b) * efficiency, i.e. the degree to which they satisfy the management 
requirements that they be economical to establish and operate 

7.2 Effectiveness Criteria 

7.2.1 A system is normally deemed effective if it performs well in the 
follov/ing respects : 

Recall 
Precision ^ 
Coverage 
Currency 
Response time 
Ease of use 

Appropriateness of fom of output 

This is a somewhat modified and extended version of the well known 

5 

list of performance criteria suggested by Lancaster . Since a system's 
• coverage', • currency ' and 'appropriateness of form of output' are not 
directly dependent upon the types of subject data it uses these criteria 
Mill be ignored here. This leaves 'Recalls 'Precision', 'Response time' 
and 'Ease of use' for further considex-ation. 

7.2. 2 'Recall' and 'Precision' are here used in a genera! sense as what Snyder^ 
has called 'criterion concepts' rather than as the names of particular 
quantitative measures of retrieval performance. A system's recall is its 
ability to retrieve relevant citations, its precision is its ability to 
avoid the retrieval of non-^relevant citations. The notions of recall and 
precision seem applicable to the whole spectrum of retrieval searches, 
though not to browsing, since this is something of a 'lucky dip' search 
technique in v/hich neither high recall nor high precision is expected* 
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7*2*5 'Response timo' will here be inteipretcd as < search effort', since 
this is the component of response time which is particularly 
dependent upon the mture of the subject data a system provides. 

7#2.4 'Ease of use' is a subjective factor which varies from user to 

user, and which is likely to be at least partly reflected in search 

time, i.e. there is probably a strong correlation between ease of 

use and speed of use. However, one aspect .of 'ease of use' warxants 

separate consideration : this is the degree to which a system may 

be used without the need for the user to comply with a variety of 

n 

special system-imposed protocols and conventions. Watt' has called 
this property of a system its 'habitability' . Althou^ this term 
is somev/hat imfamiliar it will be adopted here- in the absence of 
any obviously better alternative. 

7 • 5 Efficiency Criteria 

7.5.1 The following efficiency criteria are particularly :celevent to 
infoimation systems providing retrieval facilities : 

Indexing effort 

Vocabulaiy maintenance effort 

Efficiency improves as the effort devoted to indexing and vocabulary 
maintenance decreases. 

7.4 Search effort, habitability, indexing effort, and vocabulary maintenance" 
effort are all, in an obvious sense, subsidiary to recall and precision, 
since unless a system has at least some success in retrieving relevant 
documents and avoiding non-relevant ones, its performance in other 
respects is only of academic interest. 

8. Types of inde^dn/? system 

8.1 This section will idf>ntify various types of indexing system which might 
be used to provide subject data for EUBISED records. Systems will here 
be discussed in terms of : g . 
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a) the method of term co-ordination they uce : pre-co-ordinate 
or post-co-ordinate. 

b) their tyi>e of vocabulary 

8.2 Pro- and post- co-ordinato systems 

8.2.1 Pre-co-ordinate sygtems 

In a pre-co-ordinate system, terms are co-ordinated (i.e. combined) 
at the time of indexing, so that each index entry serves as a kind 
of < telegraphic* statement of the subject indexed e.g.* 

Students. Universities. Great Britain 
Attitudes to curriculum - Surveys 174 

Pre-co-ordinato index .entries of this type shovf the various 'contexts* 
in v/hich each lead term in the index (i.e. 'Students' in the example 
above) has occurred. This contextual information helps the user to 
decide wliether or not the documents to which a particular entry refers 
are likely to be relevant to his enquiry. 

8.2.2 Pre-co-ordinate systems which use controlled vocabularies are 
conventionally further characterised as being either synthetic or 
enumcrative. In synthetic systems, compound subjects are specified 
by selecting terms from a thesaurus or classification schedule and- 
combining these, according to a preferred citation order, into a 
pre-co-ordinatcd 'string'. Depending upon the lype of vocabulary 
employed, this stage will normally bo more or less subject-co-extensive. 
An enumerative system, on the other hand, attempts to supply ready-made 
subject headings (or class numbers) for all subjects which might form 
the focus of a user's enquiry. In practice tlie headings provided tend 
to be considerably less specific than many of the subjects they are 
required to convey - so much so that, even for monograph indexing, 

o 

the adequacy of enumerative systems is now seriously in doubt . For 
this reason systems of this type will not be discussed further in 
the context of retrieval searches (thou^ it will later be suggested that 
they may have some merit as 'browsing schemes'). Their inability to 
provide satisfactory search keys for the wide range of highly-specific 
subject likely to occur in EUDISED materials is taken to be self-evident. 
The expression 'pre-co-ordinalje system' vill henceforth be used to 
mean a 'syntjietic pre-co-oi:dinate syLtem'^ 
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8*2*3 Post"CO-ordinate systoros 

In a post-co-ordinate index, each term appears in isolation, e.g. 

Attitudes Ciirricultun Groat Britain 

174 174 174 

Students Siarveys Universities 

174 174 174 

and although it is possible to combine teims at the time of search, e.g. 

Students and Attitudes 

the searcher oannot tell 'the various contexts in which any particular 
combination of terns has occurred. 

8.2.4 The use of contextual information, as a means of making relevance 
judgements in the course of a search, depends upon the searcher ^s 
ability to interact with the index. This fmteraction is only possible 
if the index is visually displayed - as in an on-line search, or a* 
search of a bibliography. Where the index is not visible to the 
searcher - as in the case of an off-line search - all searches have 
of necessity to be conducted in a post-co-ordinate manner. It should 
be noted that v/hereas a post-co-ordinate search may bo performed in a 
pre-co-ordinate index, simply by ignoring the contextual information 
the index provides, post-co-ordinate indexes cannot be treated as 
though they were pre-co-ordinate. It is not possible, at the time 
of search, to arrange unordered lists of terms into pre-co-ordinated 
< strings' without the fear that the resulting index will contain many 
entries which inaccurately or ambiguously represent the subject thfey 
ought, ideally, to convey. 

8*3 Types of vocabulary 
8.3.1 Uncontrolled vocabularie s 

Included in this category are : 

11 
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a) vocabularies consisting of terms selected (either manually 
or automatically) fxx>m titles, abstracts, full text, or 
any combination of these. 

b) vocabularies consisting of all the words appearing in the titles 

of docments, or in their titles and abstracts. Retrieval 

activities based on this kind of vocabulary are noimally 

referred to as * free-text searches' - the technique employed, 

9 

for instance, in IBM's ITIRC system^. The possibility of 
searching the full texts of documents, as is done, for instance, 
in some legal text retrieval systems such as STATUS''^) will not 
be considered here. It will be assumed that, for cost reasons 
alone, it would be totally impracticable to convert tlie full 
text of every document in the EUBISED data base to machine- 
readable form, store this data, and provide a random access 
search facility on all significant words. 

c) vocabularies consisting of terms freely assigned by indexers 

working without the constraints of any kind of authority list 

1 1 

of terms, e.g. the 'free indexing' practised by IlISPEC ). 
8»3.2 Open-ended controlled vocabularies 
Systems of this type ; 

a) provide terms which are co-extensive with the concepts they are 
intended to convey. Since, in the course of time, new concepts 
emerge in the literature of a subject field, adherence to the 
philosophy of maximum specificity in Indexing and searching 
implies the use of an open-ended vocabulary to which new terms 
can be added as and when new concepts are encoiuitered by the 
indexer. 

b) introduce controls into the vocabulary to avoid the semantic scatter 
which occurs if the same concept is expressed by two or more different 
terms. V/here two or more terjis are equivalent in meaning for 
retrieval purposes, i.e. they represent the same concept (e.g. 
'Employoec' , 'Staff S 'Personnel') one is -chosen as the preferred 

term (say 'Personnel'). The other terms are given the status of 
Q • Forbidden 'teims, and 'See'or 'Use' references made from them to 

ERJC the preferred term. This procedure is calculated to promote recall 

by ensuring that indor<>rs tmd searchers achieve a' coincidence c)f 
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8«3«3 Flxod vocaljularies 

Vocabularies of this type are controlled but, by comparison with 
those of the open-ended kind are relatively fixed, i.e. they are 
only occasionally updated, when new editions of the vocabulary 
are published. In consequence newly-emergent concepts have 
sometimes to be expressed by whatever tenn represents the 
^nearest generic head*. This practice simplifies vocabulary 
maintenance, but introduces a danger of loss of precision in 
retrieval. 

Non-specific controlled vocabularies range from the 'almost 

specific' to the 'very broad'. Those at the broad end of the 

spect;rum - such as the limited vocabularies at one time in 

vogue amongst post-co-ordinate feature card systems - will be 

disregarded here. Vocabularies of this type generally perfonn 

12 

with lov/ precision (see, for instance, ), and are therefore 
poorly equipped to provide for retrieval searches in large-scale 
computer-based information systems. 



8.4 The tv;o factors described above - method of co-ordination and type 
of vocabulary - can be combined to produce a classification of 
six types of indexing system : . - . 





*- ■■ — ' 

METHOD OP TEIRM CO-OEDINATION 


TYPE OP VOCABULARY 

Uncontrolled 
vocabularies 


Pre-co-ordinate 

, e.g. title-based KWIC 
indexes 


Po s t- CO -ordinate 
e.g. free text 


Open-ended 
controlled 
vocabularies 


e.g. PRECIS 


e.g. Excepta Medica 
system**' 


Closed 
vocabularies 


e.g. faceted classification 
schemes 


e.g. EUDISED thesaurus^ ^ 
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The performance potential of vErlous types of Indexing system 

In Section 6.1, it was suggested that infoimtion systems coimnonly 
made provision for retrieval searches in the following situation : 
on-line searching, off-line retrospective searching, SDIi and the 
searching of reciirrent. bibliographies. Thes^ four situations can 
conveniently be reduced to two : 

a) post-co-ordinate search situation. These are the situations 

in which the. index is, of necessity, * hidden* from the user, so 
that all searching must he perfoimed in a post-co-ordinate manner, 
ignoring any contextual infonnation provided by pre-co-ordinate 
index entries. 

b) searches of visually-readable indexes, i.e. situation in which 
either pre- or post-co-ordinate searching is possible; althoughi 
as previously noted (Section 8.2.4) pre-co-ordinate searching 
presupposes the availability of a pre-co-ordinate index. 

Off-line searches whether, of the retrospective or SDI variety, are 
clearly of type (a), while searches of recurrent bibliographies 
qualify as type (b). On-line searches may be of type (a) or (b), 
depending upon the search facilities available. At .present, most 
on-line systems provide only for post-co-ordinate searching, thou^ 
there is no reason in principle why they should not allow the on-line 
display of pre-co-ordinate index entries. 

The types of indexing system defined in the last section will now be 
compared in respect of their ability to provide EODISED records with 
suitable subject data. The comparison will involve an assessment of 
the perfoxmnce potential of the various systems, both in post-co-ordinate 
search situations, and when searching visually-readable indexes. 
Performance will be judged in teims of the six criteria previously 
defined (see Section 7»4)» 
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9.5 Recall 

9.3.1 The retrieval tests conducted to date suggest that a system's 

recall perfoitnance is not significantly affected by the nature - - 
of the teims in its vocabulary, e.g» their specificity or 
method of co-ordination. Given adequate indexing and searching 
decisions any system can attain good recall, provided that its 
vocabulaxy incorporates the necessary * recall devices*. Recall 
devices - such as the confounding of synonyms, or the display of 
hierarchical relationships - allow searches to be expanded so that 
they retrieve more documents. The expansion is accomplished in a 
systematic manner so that there is a high probability that at least 
some of the additional docimients captured will be relev6<nt to the user^s 
enquiry, and that recall will thereby be improved. 

9.3«2 Recall devices may be mandatory, e.g. 

Staff See Personnel 

or optional, e.g. • 

Schools 

See also » 
Secondary schools 

In principle, all systems may embody the same recall devices, but 
different systoDQS are forced to adopt these devices in different ways. 
For instance, in an uncontrolled vocabulary, recall devices are always 
optional (since to make any of them mandatory would be to int2X)duce 
an element of vocabulary control). In an open-ended contTOlled 
vocabulary, the use of a preferred term as a substitute for one or 
more forbidden term of equivalent meaning is mandatory - it is binding 
on the searcher : other semantic links betv;een terms are shown as 
optional routes for search expansion. Fixed vocabularies tend to have 
a hi^er percentage of mandatory devices than open-ended vocabul.aries, 
since some species/genus links are treated as mandatory. 

15 
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9*3*3 I"t will here be assmned ttiat all of tho types of eystoias imder 
review could be eqrdpped v/ith the full range of recall devices » 
Recall potential will, therefore, be eliminated as a factor in 
the present comparisons. Interest can, however, be focussed 
on a related issue ; how does the' way a system is obliged, throii^^i 
the nature of its vocabulary, to adopt recall devices affect the 
effort required for indexing, searching and vocabulary rnaintenance? 

9^4 Precision 

9»4»1 Post-co-ordinate search situations 

Where only post co-ordinate searching is possible, a system* s precision 

perfonnance is largely dependent upon the specificity of its teims; 

this situation favours open-ended controlled vocabularies and uncontrolled 

vocabularies* The latter liave been shown to perfoim su3:i>ri singly well 

25 

for SDI searches in the field of -education. • In theory, their 

precision potential is lower than that of open-ended controlled 

vocabularies, if only because of their inability to distinguish between' 

the different meanings of homographs. Of the various precision devices 

adopted by post-co-ordinate systems, roles and links have generally 

l6 

been shov/n to be ineffective, and eocpensive to apply . Weighting, 
on the other hand - at the simple level of distinguishing between core 
terms and subsidia3?y terns - is almost certainly beneficial. 

9.4»2 Searches of visually'-readable indexes 

In this area pre-co-ordinate systems have a clear advantage over post- 
co-ordinate ones, throu^i their ability to show the searcher the various 

12 17 

contexts in which his search teiros have appeared. Two tests ' have 

■^^'■^ 

provided clear evidence to support the view that searchers can use the 
contextual infoimation provided by- pre-co-ordinate systems to avoid the* 
retrieval of non- relevant documents and so achieve, dramatic precision 
improvements over post-co-ordinate systems. 

9.5 Search, effort 

9*5*1 Post-co-ordinate search situations 

Uncontrolled vocabularies generally 'demand the greatest expenditure of 
effort in the compilation of search profiles, since the searcher must 
take account of all the variant yays in which the search concepts may 
^ be represented in the data base. In practice, the amount of effort J Q 

required may be reduced if the system provides suitable search options 
e.^. a tciTc truncation facility. 
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A procedure adopted by some ^post-co-ordinate hierarchically 
structured vocabularies to reduce the effort required to perform 
•species' searches is 'upward posting'. By this procedure a search 
on a generic teiin retrieves all documents which are indexed by that 
term, or by any of its species. Upward posting is perhaps of most 
obvious value in feature card systems, where it may save the searcher 
much effort in card manipulation. It does not warrant further 
consideration in- the present context, since, despite its use in 
ENDS^® (the EURATOM Nuclear Documentation System), there are other 
more flexible ways of conducting 'species' searches in large machine- 
held files (i.e. the 'explode' facility in MEDLARS).- 

9,5.2 Searches of visually-readable indexes 

Three points may be made about the types of index suited to visual 
searching : 

a) Uncontrolled indexes of any kind (pre- or post-co-ordinate) are 

inappropriate - they require the searcher to carry out too many 

'look-up 'operations to compensate for the semantic scatter 

present in the index. Vickery has stated categorically 'Uncontrolled 
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text v/ords . . i are suitable only for machine search' . 

b) Post-co-ordinate indexes may also be said to be unsuited to this 
kind of searcB, not only because of their inferior precision 
potential to pre-co-ordinate indexes, but also. on- the grounds of 
search effort. They typically present the. searcher with a list 
of postings under each term, and require him to perform Boolean 
operations on these lists (logical sum, product, and difference) 

• to identify the document numbers satisfying-his search prescription. 

These operations can be perfoimed with great speed and. accuracy 

by machine, but are highly time-consuming and exTov prone when 

carried out manually. In the opinion of this writer, the 

inappr^priateness of using post-co-ordinate systems to provide 

indexes to printed bibliographies .is clearly demonstrated in the 

20; 

index to the EUDISED R&D Bulletin 
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c) Pre-co-ordinate controlled Indexes are most sviited to 
visml searching. Search time is reduced if t 

i). the index consists of a single sequence, in which 
citations are entered directly under index entries 
(as in the BTI (British Technology Index) pystem)^"^* 

li) entries are structured according to logical rules, so 
as to provide helpful subarrangement under each heading 
(as in PRECIS) 

iii) the index provides entries under all of the significant 
torm in each pre-co-ordinated string. 

9.6 Habitabliity 

9»6.1 In on].y one situation can a clear preference be 'made, on the grounds 

of habitabliity, for a particular type of indexing system. Where 

simple on-line searches are undertaker! by uninitiated users ^ an 

uncontrolled system may prove the most habitable. Such a system 

allows the user to enter k sear'ch teira Without the need to consult 
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a controlled vocabulary, and * as Lancaster has shown - gives him 
a reasonable chance of retrieving at least a few relevant documents. 
There may well be sufficient to satisfy the user since, if only a 
portion of the database is available for on-line search (see 4*1 (^)) 
he will not in any case be dJcpectlng high recall. 

9.6.2 In principle, there is no reason why a controlled systeni should not 
be as habitable as an uncontrolled one, provided that it IS eq.tilpped 
with a fxai 'lead-in' vocabUlaiy caj)abl0 of ftapi^ing all forbidden terms 
entered by users into the correspondixig iJl?6fferired teifts jpresent in the 
date base. 

9.7 Indexing effort 

9*7 •! There is little doubt but that one of the major attractions of uncontrolled 
systems ia-Jixat they can be applied with the minimum 6f indexing effort. 
In the simplest case - a 'free- text • system - no indexing is required 

18 . 
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' at all, all significant words in titles and/or abstracts being 
accepted as search keys. Where human indexers are used either 
for 'free indexing' or for 'word-extraction' indexing they are 
able to work quickly, being free of the constraints of a controlled 
vocabulary or controlled syntax. 

9.7.2 Where controlled systems are in vise, post-co-ordinate indexing is 
generally faster and easier than pre-co-ordinate indexing, since 
in the latter case, the terms selected for a document must be 
arranged into strings according to a preferred citation order. 
Moreover, in post-co-ordinate iri^xLng, the recognition of peripheral 
topics in a dociunent normally lead to the assignment of a few 
additional terms , whereas in pre-co-ordinate indexing it often 
requires the formulation of several new strings * In practice it 
proves to be hardly feasible to use pre-co-ordinate systems for 
highly exhaustive indexing (say, an average of more than 15 terms 
per document). 

9.7.3 Some .pre-co-ordinate controlled systems possess a number of features 
which are designed to reduce indexing effort, e.g. 

a) a 'string input' facility whereby the indexer is only required 
to formulate one string for each subject indexed, this string 
being manipixlated by program to provide an index entry under 
each of the significant terms it contains 

b) a mechanism, whereby the complete network of 'See' and 'See. also' 
references appropriate to any term need only be recorded once 
and can therefore be called up as and when required, e.g. PEECIS^ 
Reference Indicator Numbers (RINs). 

c) a mechanism whereby, once the complete set of indexing data has 

been created for a particular document, that data may be automatically 
added to the records of any subsequent documents which deal with 
the same subject, e.g. PRECIS' Subject Indicator Numbers (SINs). 

19 
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9.8 Vocabulary maintenance effort 

9.8.1 The problems of vocabulary maintenance are in inverse proportion 
to the degree of vocabulary control. Fixed vocabularies present 
few pix)blems, unless they are subject to constant changes as new 
editions are issued. 

9.8.2 With open-ended controlled vocabularies, procedures' are reqviired. 

a) for continuously updating the vocabulary, as new concepts 
are encountered in indexing 

b) for notifying indexers and searchers of all new terms added. 

These tasks prove particvQarly arduous when the vocabulary is 'young* 
and has a high growth rate, but become much less time-consuming once 
the core vocabxaaxy has been established, and tlie number of new terms 
added begins to tail off. 

9.8.3 If, as was earlier assumed, even an uncontrolled system needs a full 
range of recall devices, an efficient procedure is required for managing 
its vocabulary, so as to record all of the various terms by which a 
particular concept may be represented in the data base. The major 
problem with an uncontrolled vocabxilaiy is essentially one of size - 
particularly if the teiins it contains consist of phrases rather than 
individual words. INSPECTS experience in using »free language » .phrases 
may be cited : 29,500 documents, indexed at. an average exiiaustivity 

of 6.5 phrases each, gave rise to a total vocabulary of more than 80,000 
umque phrases. ' 
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9*9 nummary 

The table bolov/ Indicates the types of system considered *best* 
in relation to each of the six perfoimance criteria : 





PERFORMANCE POITSNTIAL 


For high recall 


All systems potentially equal 


.For high precision 


Po s t- coordinate 

search 

situations 


Open-ended controlled vocabularies, 
\mcon trolled vocabularies 


Searches of 
visually-read- 
able indexes 


Pre-co-ordinate systems 


For low search 
effort 


Post-coordinate 

search 

situations 


Controlled vocabularies 


Searches of 
visually-read- 
able indexes 


Pre-co-ordinate systems with 
controlled vocabularies 


"For habitability 


Uncontrolled vocabularies (for 
certain types of on-line search, 
see 9*6.1) 


For low indexing effort 


Uncontrolled vocabularies 


For low vocabulary maintenance effort 


Fixed vocabularies 

^ .! 
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Browsln/? echemc g 

1 In Section 6»1 it was envisaged that cvirrent awareness bulletins 
and reciirrent biblio^jraplues would make provision for browsing 
searches. The following conditions are either desirable or 
Ifiecessary for browsing to bo possible ; 

a) citations should be arranged in groups, in' such a way that 
each group represents a subject or foira of document of possible 
interest to users. 

b) the groups should be arranged in a helpful order, which reflects 
the consensus view of the broad structure* of the subject field 
of education (if such a view can be determined). 

c) each citation should contain, or be accompanied by, an explicit 
statement of the subject content of the document to which it 
relates. The puiTpose of this statement is to help the browser 
decide whether or not the document is likely to be of interest 
to him. 

a) and b) caji be provided by a 'browsing scheme' i.e. a classification 
scheme or a system of subject headings. Titles might serve as the subject 
statements referred to in c), thou^ abstracts would be preferable. 
Alternatively, each citation might be equipped with a specially 
constructed 'feature heading' derived from a pre-co-ordlnated string « 
of terms (cf the PRECIS-derived feature headings attached to citations 
in the 'British National Bibliography'). 

2 The view adopted here is that the functions of browsing schemes ar^. 
as limited as indicated above. They are not required : 

a) to provide for specific retrieval searches; this. is the function 
of subject indexes 

\>) to provide each citation with a unique identifying number which can 
serve as the link between the citation and its subject index entries. 
Such a link is required but is preferably made independent of the 
browsing scheme, (it is appropriate to note, at this point, the 
practice adopted in the axmual volumes of Library and Information * 
Science Abstracts' : citations are arranged in classified order, 
but are also given a sijnple inmning number which serves as the link 
between index entries and CjLtations). 22 
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10.3 A notional strategy v;ill be suggested below for the use of a 

brov^sing scheme in the form of a simple enumerative classification. 
The scheme might follow the overall structure of an existing 
classification for the field of education e.g. the London Education 
Classification^^ The plan envisaged here requires that each of 
the class numbers in the scheme be associated with several subject 
headings; each heading would be in a different language, but all 
would express the subject represented by the class number. A 
multi-lingual thesaurus containing a set of subject headings for 
each class number in the browsing scheme would be available in both 
printed and machine-readable form. Class numbers would be added to 
EUDISED records as part of their subject data. When a batch of 
records v;as processed to produce a bibliography or current awareness 
bulletin, the class number occurring in that batch would be input to 
a program which would access the multi-lingual thesaurus and extract 
the corresponding subject headings (in whatever language had been 
specified). Two courses of action are now possible : 

a) create a classified file, in which each class number is 
accompanied by an explanatory subject heading 

b) create an alphabetical subject sequence, in which class numbers 
are discarded and citations are arranged directly under subject 
headings 

This strategy is of course highly tentative. It does, however, 
illustrate the possibilities of using a "flexible browsing scheme 
capable of providing both alphabetical and classified arrangements, and 
sensitive to the need for multi-lingual access. 

11, Multi-lin/?ualism 

11.1 It is obviously possible to provide multi-lingual access to documents 
by first abstracting and indexing them in one language, and then 
translating all abstracts and index terms into several other languages. 
However, this is an extremely uneconomical procedure. The only viable 
approach to the problems -of mwlti-lingualism in a- large scale system, 
such as EUDISED, lies in the development of trans-lingual procedures. I.e. 
procedures for 'switching' terms from one language to another, either 
automatically, or, by methods which require only minimal human intervention. 
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11,2 Txanslinguallsm may be attempted at a number of levels : 

a) the translingual switching of subject headings in a browsing 
scheme 

b) the translingual switching of terms in a post-co-ordinate system 

c) the translingual switching of pre-co-ordinated strings of terms 

d) the automatic translation of abstracts 

e) the automatic translation of texts 

a) has already been touched on, and will not "be discussed further; 

e) is a topic which lies outside the scope of this paper. The leaves, 

b) - d) for further consideration. 

11. 3' The translingual switching of post-co-ordinate terms can be achieved 
"* . in two ways : 

a) the direct equivalence approach , based upon a multi-lingual 
thesaurus, in which one-to-one equivalences are established 
between the terms of each language. This approach has the 
advantage of simplicity» the terms of one language are directly 
convertible to the terms of another. It brings with it the 
disadvantage that to achieve direct convertibility betv/een terms, 
any specific terms in one language which do not have exact 
counterparts in all pf the other languages are omitted from the 
thesaurus. This practice imposes an artificial limit on the 
specificity of terms, and, hence, reduces the precision potential 
of the thesaurus. 

b) the 'switching language' approach , which entails the development 
of : 

i) a 'switching language' containing a language-independent 
'concept number' for each concept indexed (regardless of 
whether or not this concept can be expressed specifically In 
any particular language) 

ii) two types of conversion table for each language 

- a term-to-concept-number conversion table 

- a concept-number-to-term conversion table 
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It is enviaagcd that in a decentralised multi-lingual netv/ork, 
conversion tables would be used as follows ; 

i) each centre in the network would index in its own 
language at the nu?xiraum level of specificity 

ii) before contributing its records to the network, a centre 
would automatically convert all Uocal language' teims 
to concept numbers, by means of a teim-to-concept-number 
conversion table 

iii) on receiving a batch of records from an external source, a 
centre would use a cOncept-number^to-tQim conversion table 
to convert the concept numbers on the in-coming records 
into teims in the ' local language* 

The advantage of this approach is that it in no case interferes 
with indexing specificity. 

11*4 The trans-lingual switching of pre-co-ordinate strings of teims presents 
special problems. Not only must it be possible to switch each of the 
terms in a string from one language to another, but this must be 
accomplished without distorting the meaning of the strings as a whole. 
Much work in this area has yet to be done. A fruitful approach to the 
problem seems to lie' in the use of pre-co-ordinate system with a 
generalised language-independent syntax (such as the BTI system, or 
precis), in conjunction with a switching language. 

11.5 So far as is known only the TITUS system^^ can justly claim to be 

capable or automatically translating abstracts. The abstracts prepared 
for TITUS may be written in any of four languages, English, French, 
German and Spanish, but must be phrased in a stylised manner, according 
to the rules of a restricted syntax. The system uses a « switching 
language* to convert all abstracts to a series of language-independent 
codes. The abstracts are stored in this form but may be processed by 
program to give output in any of the four languages previously noted. 
There are two questions about the performance of TITUS which remain, 
as yet, unanswered : 2^ 
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a) does the restricted syntax demanded by the system seriously 
reduce the quality of its abstracts? 

b) is the system too costly to operate? - the time required 
to write an abstract for TITUS is known to be much greater 
than tliat required for the preparation of a conventional 
abstract; 

12» A proposed strateg y for the provision of subject data on EODISED rooords 

12.1 Thompson^ has suggested that the bibliographic analysis of^BDDISED 
materials mi^t be carried out at several levels. The degree of 
analysis appropriate to any particular type of material would be 
deteimined by its ^importance' (as judged by whatever criteria mij^it 
he established). Three levels of subject analysis are proposed 
below, ranging from Level 1, the most superficial, to Level 3t the 
most thorou^. 

12.2 Level 1 

a) Controllod indexing 

All documents would be indexed pre-co-ordinately using R 
controllod vocabulary. Typically, a document would be assigned 
one string of, perhaps, five or six teims. (Multi- topical 
docments would, natuirally, receive more than one string). Strings 
would be used; 

i) to produce visually-readable pre-co-brdinate indexes 

ii) as the basis for machine- searching 

It would not be possible to guarantee hi^ recall in all cases, 
because of the relatively low exhaustivity of indexing employed. 
A possible subsidiary used for pre-co-ordinate strings would be 
in the provision of » feature headings' to aid browsing (see 10. l). 

fjhe system, used for pre-co-ordinate indexing should possess : 

i) an open-ended thesaurus incorporating the full range of 

recall devices. The' thesiNurus would serve as i;he basis for 
the 'See* and 'See also* references provided in visually- 
readable indexes. It would also assist in the construction 
of profiles for machine searching. 
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ii) the ability to provide an index entry amder each significant 
term in a string 

iii) a generalised syntax which : 

- ensures that entries are structured consistently 
so as to promote collocation in pre-co-ordinate 
indexes 

^ offers a "basis for developing a multilingual facility within 
• the system. Any such development would .natumlly draw 
heavily on the existing BODISED multilingual thesaurus, 
but the vocabulary of this thesaurus would need to be 
extended, and> in some cases, terms would need to be 
modified to fit into the framework of a pre-co-ordinate' 
system 

iv) various 'labour-saving' features aimed at minimising indexing 
effort, e.g. . 

- a 'string input' facility (9*7 • 3 a)) 

- an efficient mechanism for calling up the network of 
'See' and 'See also' references appropriate to any teim 
(9.7.3 b)) 

- an efficient mechanism for handling 'recurrent 'subjects 
(9.7.3 c)) 

b) Uncontrolled indexing 

The natural language titles occurring in the citation in the databse ' 
would be available as additional search keys for machine searching. 

c) Classification 

All docvmients would be assigned one or more class numbers from a 
broad emumerative classification (which might be modelled on an 
existing classification for the field of education, such as The 
London Education Classification). If the suggestion made in 10. J 
is accepted, the classification would form one component of an 
integrated classification/subject heading system, in which each 

27 



ERIC 



- 26 - 



class number was equated with several semantically equivalent 
subject headings in various languages. The class numbers 
assigned to EUDISED records would then be capable of providing 
bibliographic tools with two types of browsing facility : a , 
classified sequence of citations, and a sequence arranged 
alphabetically by subject Heading. 'Class numbers might also 
be used in machine searching with or without other search terms. 
Their purpose would be to restrict a search to a particular class 
of record or to identify a particular subset of the database in 
preparation for the production of a special puarpose bibliography 
or current awareness bulletin. 



12.3 Level 2 



At this level, the indexing data assigried at Level i would.be 'enriched' 
by Additional terms selected from the vocabulary of the pre-co-ordtnate 
indexing system. 

The "enrichment' terms (say, 5 or 6 per document) would be chosen so 
as to express concepts Which were Jiot indexed at Level 1 , because of 
the low exhaustivity of indexing practised at that level. 

For the sake of economy in indexi^ng effort, enrichment terms would not 
be pre-co-ordinated into strings, and would, therefore, play no part 
in the production of visually-readable indexes. Their sole, purpose 
would be to increase the indexing exhaustivity, and so improve the 
recall of post-co-ordinate machine searches .r The database would preserve 
the distinction between Level" 1 terms and enrichment terms so that the 
former could, if necessary, be given a hi^er for the purposes of 
".machine searching. , ' ' 



12.4 Level 3 



Abstracts would be prepared for all documents processed at this level, 
so providing a basis for 'free text' searching. 
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